Supplementary Notebook S5: Peptide Identification Benchmarking on Simulated Datasets¶
- License: Creative Commons Attribution-NonCommercial 4.0 International License
- Version: 0.1
- Edit Log:
- 2025-11-28: Initial version of the notebook
Requirements:
- Completed method runs:
02-runCOPF.R,03-runPeCorA.R,04-runProteoForge.py - Result files in
./data/Sim{1-4}/directories
Data Information:
This notebook processes method results from four simulation scenarios to evaluate peptide identification performance—the ability to correctly identify which peptides are perturbed.
| Simulation | Focus | Input Files Pattern |
|---|---|---|
| Sim1 | Complete vs. Imputed | 2_{pattern}_{type}_{method}_ResultData.feather |
| Sim2 | Missingness Levels | 2_Pro{rate}_Pep{rate}_imputed_{method}_ResultData.feather |
| Sim3 | Perturbation Magnitude | 2_{low}_{high}_{method}_ResultData.feather |
| Sim4 | Experimental Complexity | 2_{N}Cond_{overlap}_{dir}Dir_{method}_ResultData.feather |
Purpose:
Benchmark COPF, PeCorA, and ProteoForge on peptide identification using ROC curves, PR curves, and Matthews Correlation Coefficient (MCC) across multiple p-value thresholds.
Setup¶
This section imports libraries, configures display settings, and defines paths for outputs.
Note: The HTML rendering of this notebook hides code cells by default. Click the "Code" buttons to expand them.
Libraries¶
import os
import sys
import warnings
warnings.filterwarnings("ignore")
import numpy as np # Numerical computing
import pandas as pd # Data manipulatio
import seaborn as sns # R-like high-level plots
import matplotlib.pyplot as plt # Python's base plotting
sys.path.append('../')
# Utility imports for this analysis
from src import utils, plots
# Initialize the timer
startTime = utils.getTime()
Display Settings¶
The cell below configures pandas, matplotlib, and seaborn display options for improved readability of tables and figures, including color palettes and figure export settings.
## Figure Settings
# Define default colors and styles for plots
def_colors = [
"#139593", "#fca311", "#e54f2a",
"#c3c3c3", "#555555",
"#690000", "#5f4a00", "#004549"
]
# Set seaborn style
sns.set_theme(
style="white",
context="paper",
palette=def_colors,
font_scale=1,
rc={
"figure.figsize": (6, 4),
"font.family": "sans-serif",
"font.sans-serif": ["Arial", "Ubuntu Mono"],
}
)
# Figure Saving Settings
figure_formats = ["pdf"]
save_to_folder = True
transparent_bg = True
figure_dpi = 300
## Configure dataframe displaying
pd.set_option('display.float_format', lambda x: '%.4f' % x)
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 500) # Set a wider display width
## Printing Settings
verbose = True
## Show the color-palettes
plots.color_palette( def_colors, save=False )
Data and Result Paths¶
Defines input/output directory structure:
- Input:
./data/Sim{N}/— Method result files from each simulation - Output: Processed performance data and figures saved to same directories
# Establish paths and
notebook_name = "05-IdBenchmark"
input_path = f"./data/"
mainFig_path = f"./figures/"
def setup_simulation_paths( simID ):
"""
Create output and figure directories for a simulation, if they do not exist.
(Uses global variable for save_to_folder and figure_formats)
Args:
simID (str): Simulation ID.
(Global Variables)
input_path (str): Base path for input data.
mainFig_path (str): Base path for figures.
save_to_folder (bool): Whether to save figures to folders.
figure_formats (list): List of formats to save figures in.
Returns:
tuple: (output_path, figure_path)
"""
output_path = f"{input_path}{simID}/"
if not os.path.exists(output_path):
os.makedirs(output_path)
figure_path = f"{mainFig_path}{simID}/"
if save_to_folder:
for fmt in figure_formats:
cur_folder = os.path.join(figure_path, fmt)
if not os.path.exists(cur_folder):
os.makedirs(cur_folder)
return output_path, figure_path
Global Variables¶
Defines analysis constants and visualization styling:
seed,pthr: Reproducibility seed and p-value threshold (10⁻³)thresholds: Range of p-value thresholds for curve generationmethod_palette,method_markers: Consistent styling for COPF, PeCorA, ProteoForgemcc_thresholds,mcc_colors: MCC interpretation scale (Random → Almost Perfect)
# Global variables
seed = 42 # Seed for reproducibility
pthr = 10**-3 # p-value threshold for significance
thresholds = list(utils.generate_thresholds(10.0, -15, 1, 0, 1, 0.1)) # Thresholds for the analysis
# Methods and their plotting styles
method_palette = {
"COPF": "#139593",
"PeCorA": "#fca311",
"ProteoForge": "#e54f2a",
}
method_styles = {
"COPF": "--",
"PeCorA": "-.",
"ProteoForge": ":",
}
method_markers = {
"COPF": "o",
"PeCorA": "s",
"ProteoForge": "^",
}
# Matthews Correlation Coefficient (MCC) thresholds and colors
mcc_thresholds = {
0.0 : 'Random',
0.3 : 'Fair (0.3)',
0.5 : 'Moderate (0.5)',
0.7 : 'Strong (0.7)',
0.9 : 'Almost Perfect (0.9)',
}
mcc_colors = {
'Random': '#c3c3c3',
'Fair (0.3)': '#e0aaff',
'Moderate (0.5)': '#9d4edd',
'Strong (0.7)': '#5a189a',
'Almost Perfect (0.9)': '#240046'
}
plots.color_palette(mcc_colors, save=False, name="MCC Interpretation Colors")
Simulation 1: Complete vs. Imputed Data Comparison¶
Objective: Compare peptide identification performance between complete (fully quantified) and imputed datasets across different perturbation patterns.
Input Data: ./data/Sim1/2_{pattern}_{complete|imputed}_{method}_ResultData.feather
Perturbation Patterns:
- twoPep: 2 peptides perturbed per protein
- randomPep: Random 10–50% of peptides perturbed
- halfPep: 50% of peptides perturbed
- halfPlusPep: 70% of peptides perturbed
Metrics Generated:
- ROC curves (FPR vs TPR) with AUC
- PR curves (Recall vs Precision) with AUC
- MCC across p-value thresholds
Key Question: Does imputation affect method performance compared to complete data?
Data Processing and Metric Calculation¶
stTime = utils.getTime()
simID = "Sim1"
# Set up a path for the simulation
output_path, figure_path = setup_simulation_paths( simID )
methods = list(method_palette.keys())
dataTypes = ['complete', 'imputed']
experiments = ['twoPep', 'randomPep', 'halfPep', 'halfPlusPep']
experiment_mapper = {
'twoPep': 'Two Peptides',
'randomPep': '2>50% Peptides',
'halfPep': '50% Peptides',
'halfPlusPep': '>50% Peptides',
}
# =============================================================================
# ANALYSIS SETUP
# =============================================================================
print("PEPTIDE IDENTIFICATION BENCHMARK ANALYSIS")
print("-" * 52)
print(f"Simulation ID: {simID}")
print(f"Methods: {', '.join(methods)} ({len(methods)} total)")
print(f"Data Types: {', '.join(dataTypes)}")
print(f"Experiments: {len(experiments)} scenarios")
print(f"Output Path: {output_path}")
# Progress tracking
total_combinations = len(methods) * len(dataTypes) * len(experiments)
current_combination = 0
files_found = 0
files_missing = 0
print(f"Total combinations to process: {total_combinations}")
print("-" * 52)
# =============================================================================
# DATA PROCESSING
# =============================================================================
display("Calculating ID Benchmarks:")
combined_results = []
for cur_method in methods:
for cur_type in dataTypes:
for cur_exp in experiments:
current_combination += 1
# Load the result data
result_file = f"{output_path}2_{cur_exp}_{cur_type}_{cur_method}_ResultData.feather"
if os.path.exists(result_file):
res_df = pd.read_feather(result_file)
files_found += 1
# Show progress for successful loads only
if verbose:
progress = f"[{current_combination:2d}/{total_combinations}]"
print(f"{progress} ✓ {cur_method:12} | {cur_type:9} | {experiment_mapper[cur_exp]}")
else:
files_missing += 1
if verbose:
progress = f"[{current_combination:2d}/{total_combinations}]"
print(f"{progress} ✗ {cur_method:12} | {cur_type:9} | {experiment_mapper[cur_exp]} (MISSING)")
continue
# Handle COPF-specific column renaming
if cur_method == "COPF":
res_df = res_df.rename(columns={
"proteoform_score_pval": "adj_pval",
'protein_id': "Protein"
})
# Create metric data
metric_data = utils.create_metric_data(
res_df,
pvalue_thresholds=thresholds,
label_col='pertPeptide',
pvalue_col='adj_pval'
)
metric_data['Method'] = cur_method
metric_data['DataType'] = cur_type
metric_data['Experiment'] = cur_exp
combined_results.append(metric_data)
# Combine and process data
idBenchmarkData = pd.concat(combined_results, ignore_index=True)
idBenchmarkData['Experiment'] = idBenchmarkData['Experiment'].map(experiment_mapper)
# Categorical ordering ensure twoPep, randomPep, halfPep, halfPlusPep
idBenchmarkData['Experiment'] = pd.Categorical(
idBenchmarkData['Experiment'],
categories=[experiment_mapper[exp] for exp in experiments],
ordered=True
)
print(f"\nDATA PREVIEW:")
display(idBenchmarkData.tail(5))
# Save the processed data
output_file = f"{output_path}4_{simID}_PeptideIdentification_PerformanceData.feather"
idBenchmarkData.to_feather(output_file)
idBenchmarkData.to_csv(output_file.replace('.feather', '.csv'), index=False)
print(f"\nRESULTS SUMMARY:")
print("=" * 52)
execTime = utils.prettyTimer(utils.getTime() - stTime)
print(f"Total execution time: {execTime}")
print(f"Files processed successfully: {files_found}")
print(f"Files missing/skipped: {files_missing}")
print(f"Processing success rate: {files_found/(files_found+files_missing)*100:.1f}%")
print("-" * 52)
print(f"Final dataset shape: {idBenchmarkData.shape}")
print(f"Unique experiments: {idBenchmarkData['Experiment'].nunique()}")
print(f"Methods analyzed: {idBenchmarkData['Method'].nunique()}")
print(f"Data types included: {idBenchmarkData['DataType'].nunique()}")
print(f"Data saved to: {output_file}")
print("=" * 52)
PEPTIDE IDENTIFICATION BENCHMARK ANALYSIS ---------------------------------------------------- Simulation ID: Sim1 Methods: COPF, PeCorA, ProteoForge (3 total) Data Types: complete, imputed Experiments: 4 scenarios Output Path: ./data/Sim1/ Total combinations to process: 24 ----------------------------------------------------
'Calculating ID Benchmarks:'
[ 1/24] ✓ COPF | complete | Two Peptides [ 2/24] ✓ COPF | complete | 2>50% Peptides [ 3/24] ✓ COPF | complete | 50% Peptides [ 4/24] ✓ COPF | complete | >50% Peptides [ 5/24] ✓ COPF | imputed | Two Peptides [ 6/24] ✓ COPF | imputed | 2>50% Peptides [ 7/24] ✓ COPF | imputed | 50% Peptides [ 8/24] ✓ COPF | imputed | >50% Peptides [ 9/24] ✓ PeCorA | complete | Two Peptides [10/24] ✓ PeCorA | complete | 2>50% Peptides [11/24] ✓ PeCorA | complete | 50% Peptides [12/24] ✓ PeCorA | complete | >50% Peptides [13/24] ✓ PeCorA | imputed | Two Peptides [14/24] ✓ PeCorA | imputed | 2>50% Peptides [15/24] ✓ PeCorA | imputed | 50% Peptides [16/24] ✓ PeCorA | imputed | >50% Peptides [17/24] ✓ ProteoForge | complete | Two Peptides [18/24] ✓ ProteoForge | complete | 2>50% Peptides [19/24] ✓ ProteoForge | complete | 50% Peptides [20/24] ✓ ProteoForge | complete | >50% Peptides [21/24] ✓ ProteoForge | imputed | Two Peptides [22/24] ✓ ProteoForge | imputed | 2>50% Peptides [23/24] ✓ ProteoForge | imputed | 50% Peptides [24/24] ✓ ProteoForge | imputed | >50% Peptides DATA PREVIEW:
| TP | FP | TN | FN | TPR | FPR | FDR | MCC | Precision | Recall | F1 | threshold | Method | DataType | Experiment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 595 | 825 | 991 | 2488 | 1379 | 0.3743 | 0.2849 | 0.5457 | 0.0935 | 0.4543 | 0.3743 | 0.4104 | 0.6000 | ProteoForge | imputed | >50% Peptides |
| 596 | 852 | 1014 | 2465 | 1352 | 0.3866 | 0.2915 | 0.5434 | 0.0987 | 0.4566 | 0.3866 | 0.4187 | 0.7000 | ProteoForge | imputed | >50% Peptides |
| 597 | 871 | 1038 | 2441 | 1333 | 0.3952 | 0.2984 | 0.5437 | 0.0999 | 0.4563 | 0.3952 | 0.4235 | 0.8000 | ProteoForge | imputed | >50% Peptides |
| 598 | 891 | 1050 | 2429 | 1313 | 0.4043 | 0.3018 | 0.5410 | 0.1053 | 0.4590 | 0.4043 | 0.4299 | 0.9000 | ProteoForge | imputed | >50% Peptides |
| 599 | 2204 | 3479 | 0 | 0 | 1.0000 | 1.0000 | 0.6122 | 0.0000 | 0.3878 | 1.0000 | 0.5589 | 1.0000 | ProteoForge | imputed | >50% Peptides |
RESULTS SUMMARY: ==================================================== Total execution time: 00h:00m:03s Files processed successfully: 24 Files missing/skipped: 0 Processing success rate: 100.0% ---------------------------------------------------- Final dataset shape: (600, 15) Unique experiments: 4 Methods analyzed: 3 Data types included: 2 Data saved to: ./data/Sim1/4_Sim1_PeptideIdentification_PerformanceData.feather ====================================================
ROC Curves: Discrimination Ability¶
ROC curves show the trade-off between true positive rate (sensitivity) and false positive rate (1-specificity). AUC values indicate overall discrimination ability—higher is better. Black circles mark performance at the significance threshold (p < 10⁻³).
# ROC Curves for different perturbations and methods
fig, axes = plt.subplots(
2, 4, figsize=(16, 6),
sharey=True, sharex=True,
gridspec_kw={
"wspace": 0.05,
"hspace": 0.05
}
)
for i, pert in enumerate(idBenchmarkData['Experiment'].unique()):
for j, dataType in enumerate(dataTypes):
cur_data = idBenchmarkData[
(idBenchmarkData["Experiment"] == pert) &
(idBenchmarkData["DataType"] == dataType)
]
# Ensure the data is complete for ROC curves
cur_data = cur_data.groupby("Method").apply(
lambda x: utils.complete_curve_data(x, 'ROC', 'FPR', 'TPR')
).reset_index(drop=True)
# Calculate AUC per method from TPR and FPR
auc_data = cur_data.groupby("Method").apply(
lambda x: np.trapezoid(
x.sort_values("FPR")["TPR"], x.sort_values("FPR")["FPR"]
),
include_groups=False
)
# Plot the ROC curve
sns.lineplot(
data=cur_data,
x="FPR",
y="TPR",
hue="Method",
style="Method",
palette=method_palette,
ax=axes[j, i],
# Show the points
markers=method_markers,
dashes=False,
markersize=7.5,
markeredgewidth=0.5,
markeredgecolor="gray",
legend=False,
rasterized=True,
estimator=None
)
# Add AUC values as legend like text
for k, method in enumerate(auc_data.index):
auc = auc_data[method]
# Add color to match the palette
color = method_palette[method]
axes[j, i].text(
0.95,
0.05 + k * 0.075,
f"{method}: {auc:.2f}",
color=color,
transform=axes[j, i].transAxes,
ha="right",
va="bottom",
fontsize=12,
fontweight="bold",
)
# Add a large no-facecolor marker to the pthr value for each method
pthr_data = cur_data[(cur_data["Method"] == method) & (cur_data["threshold"] == pthr)]
axes[j, i].scatter(
pthr_data["FPR"],
pthr_data["TPR"],
color=color,
s=125,
edgecolor="black",
linewidth=1.25,
marker="o",
facecolors="none",
)
# Add the diagonal line
axes[j, i].plot([0, 1], [0, 1], color="black", linestyle="--")
# axes[j, i].set_title(f"{pert}", fontsize=14, fontweight="bold")
axes[j, i].set_xlabel("FPR", fontsize=12)
axes[j, i].set_ylabel("TPR", fontsize=12)
axes[j, i].set_xlim(-0.05, 1.05)
axes[j, i].set_ylim(-0.05, 1.05)
axes[j, i].grid("both", linestyle="--", linewidth=0.75, alpha=0.5, color="lightgrey")
# If top row set the title
if j == 0:
axes[j, i].set_title(f"{pert}", fontsize=14, fontweight="bold")
# For the last column, set the text for the dataType on the right
if i == 3:
axes[j, i].text(
1.05,
0.5,
f"{dataType.capitalize()}",
transform=axes[j, i].transAxes,
ha="left",
va="center",
fontsize=14,
fontweight="bold",
rotation=270,
)
# Add a text indicating the circle is the p-value threshold under fig.title and above the first subplot
fig.text(
0.9,
.97,
f"Black circles represent the TPR and FPR values at p-value threshold of {pthr}",
ha="right",
va="center",
fontsize=12,
fontstyle="italic",
color="gray",
)
# Finalize the plot
plt.tight_layout()
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"{simID}_PepIDBenchmark_ROC_Curve",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
Figure saved to: ./figures/Sim1//pdf/Sim1_PepIDBenchmark_ROC_Curve.pdf
Precision-Recall Curves: Performance Under Class Imbalance¶
PR curves are particularly informative when positive cases (perturbed peptides) are rare. Higher AUC-PR indicates better precision at various recall levels. Black circles mark the significance threshold.
# PR Curves for different perturbations and methods
fig, axes = plt.subplots(
2, 4, figsize=(16, 6),
sharey=True, sharex=True,
gridspec_kw={
"wspace": 0.05,
"hspace": 0.05
}
)
for i, pert in enumerate(idBenchmarkData['Experiment'].unique()):
for j, dataType in enumerate(dataTypes):
cur_data = idBenchmarkData[
(idBenchmarkData["Experiment"] == pert) &
(idBenchmarkData["DataType"] == dataType)
]
# Add a small value to avoid division by zero
cur_data["Precision"] = cur_data["TP"] / (cur_data["TP"] + cur_data["FP"] + 1e-6)
cur_data["Recall"] = cur_data["TP"] / (cur_data["TP"] + cur_data["FN"] + 1e-6)
cur_data = cur_data.sort_values("Recall", ascending=False)
cur_data = cur_data[~((cur_data["Recall"] == 0) & (cur_data["Precision"] == 0))]
# Ensure the data is complete for PR curves
cur_data = cur_data.groupby("Method").apply(
lambda x: utils.complete_curve_data(x, 'PR', 'Recall', 'Precision')
).reset_index(drop=True)
# Calculate AUC per method from Precision and Recall
# Here we use the trapezoidal rule to calculate the area under the curve
f1_data = cur_data.groupby("Method").apply(
lambda x: np.trapezoid(
x.sort_values("Recall")["Precision"], x.sort_values("Recall")["Recall"]
),
include_groups=False
)
# Plot the PR curve
sns.lineplot(
data=cur_data,
x="Recall",
y="Precision",
hue="Method",
style="Method",
palette=method_palette,
ax=axes[j, i],
# Show the points
markers=method_markers,
dashes=False,
markersize=7.5,
markeredgewidth=0.5,
markeredgecolor="gray",
legend=False,
rasterized=True,
estimator=None
)
# Add AUC values as legend like text
for k, method in enumerate(f1_data.index):
f1 = f1_data[method]
# Add color to match the palette
color = method_palette[method]
axes[j, i].text(
0.05,
0.05 + k * 0.075,
f"{method}: {f1:.2f}",
color=color,
transform=axes[j, i].transAxes,
ha="left",
va="bottom",
fontsize=12,
fontweight="bold",
)
# Add a large no-facecolor marker to the pthr value for each method
pthr_data = cur_data[(cur_data["Method"] == method) & (cur_data["threshold"] == pthr)]
axes[j, i].scatter(
pthr_data["Recall"],
pthr_data["Precision"],
color=color,
s=125,
edgecolor="black",
linewidth=1.25,
marker="o",
facecolors="none",
)
axes[j, i].set_xlabel("Recall", fontsize=12)
axes[j, i].set_ylabel("Precision", fontsize=12)
axes[j, i].set_xlim(-0.05, 1.05)
axes[j, i].set_ylim(-0.05, 1.05)
axes[j, i].grid("both", linestyle="--", linewidth=0.75, alpha=0.5, color="lightgrey")
# If top row set the title
if j == 0:
axes[j, i].set_title(f"{pert}", fontsize=14, fontweight="bold")
# For the last column, set the text for the dataType on the right
if i == 3:
axes[j, i].text(
1.05,
0.5,
f"{dataType.capitalize()}",
transform=axes[j, i].transAxes,
ha="left",
va="center",
fontsize=14,
fontweight="bold",
rotation=270,
)
# Add a text indicating the circle is the p-value threshold under fig.title and above the first subplot
fig.text(
0.9,
.97,
f"Black circles represent the Precision and Recall values at p-value threshold of {pthr}",
ha="right",
va="center",
fontsize=12,
fontstyle="italic",
color="gray",
)
# Finalize the plot
plt.tight_layout()
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"{simID}_PepIDBenchmark_PR_Curve",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
Figure saved to: ./figures/Sim1//pdf/Sim1_PepIDBenchmark_PR_Curve.pdf
MCC Across Thresholds: Optimal Threshold Selection¶
Matthews Correlation Coefficient balances all confusion matrix elements. The curves show how MCC varies with p-value threshold. Black circles mark maximum MCC achieved. Horizontal lines indicate MCC interpretation levels (Random → Almost Perfect).
best_thresholds = idBenchmarkData.groupby(["Method", "Experiment", "DataType"]).apply(
lambda x: x.loc[x["MCC"].idxmax(), ["threshold", "MCC"]]
).reset_index()
best_thresholds.columns = ["Method", "Experiment", "DataType", "threshold", "MCC"]
# print("\nBest thresholds per method and experiment:")
# display(best_thresholds)
# MCC Curves for p-value thresholds
fig, axes = plt.subplots(
2, 4, figsize=(16, 6),
sharey=True, sharex=True,
gridspec_kw={
"wspace": 0.05,
"hspace": 0.05
}
)
for i, pert in enumerate(idBenchmarkData['Experiment'].unique()):
for j, dataType in enumerate(dataTypes):
cur_data = idBenchmarkData[
(idBenchmarkData["Experiment"] == pert) &
(idBenchmarkData["DataType"] == dataType)
].copy()
cur_data['-log10(threshold)'] = -np.log10(cur_data['threshold'])
# Plot the MCC curve
sns.lineplot(
data=cur_data,
x="-log10(threshold)",
y="MCC",
hue="Method",
style="Method",
palette=method_palette,
ax=axes[j, i],
# Show the points
markers=method_markers,
dashes=False,
markersize=7.5,
markeredgewidth=0.5,
markeredgecolor="gray",
legend=False,
rasterized=True,
estimator=None
)
# Add the best MCC point for each method
for k, method in enumerate(cur_data["Method"].unique()):
best_data = best_thresholds[
(best_thresholds["Method"] == method) &
(best_thresholds["Experiment"] == pert) &
(best_thresholds["DataType"] == dataType)
]
if not best_data.empty:
best_threshold = best_data["threshold"].values[0]
best_mcc = best_data["MCC"].values[0]
axes[j, i].scatter(
-np.log10(best_threshold),
best_mcc,
color=method_palette[method],
s=125,
edgecolor="black",
linewidth=1.25,
marker="o",
facecolors="none",
)
# # Annotate the best MCC point
# Add color to match the palette
color = method_palette[method]
axes[j, i].text(
0.95,
0.75 + k * 0.075,
f"{method}: {best_mcc:.2f}",
color=color,
ha="right",
va="bottom",
fontsize=12,
fontweight="bold",
transform=axes[j, i].transAxes,
)
axes[j, i].set_xlabel("-log10(p-value threshold)", fontsize=12)
axes[j, i].set_ylabel("MCC", fontsize=12)
axes[j, i].set_xlim(-0.5, 15.5)
axes[j, i].set_ylim(-0.25, 1.05)
axes[j, i].grid("both", linestyle="--", linewidth=0.75, alpha=0.5, color="lightgrey")
# Add horizontal line at y=0
# axes[j, i].axhline(0, color="black", linestyle="--", linewidth=1)
# Draw MCC interpretation thresholds
for thresh, label in mcc_thresholds.items():
axes[j, i].axhline(
thresh, color=mcc_colors[label], alpha=1,
linestyle="dotted", linewidth=1.5,
label=label, zorder=0
)
# axes[j, i].text(
# 0.01,
# thresh + 0.02,
# label,
# color=mcc_colors[label],
# ha="left",
# va="bottom",
# fontsize=10,
# fontweight="bold",
# transform=axes[j, i].transAxes,
# )
# If top row set the title
if j == 0:
axes[j, i].set_title(f"{pert}", fontsize=14, fontweight="bold")
# For the last column, set the text for the dataType on the right
if i == 3:
axes[j, i].text(
1.05,
0.5,
f"{dataType.capitalize()}",
transform=axes[j, i].transAxes,
ha="left",
va="center",
fontsize=14,
fontweight="bold",
rotation=270,
)
# Add a text indicating the circle is the best MCC point under fig.title and above the first subplot
fig.text(
0.9,
.97,
f"Black circles represent the maximum MCC values per method",
ha="right",
va="center",
fontsize=12,
fontstyle="italic",
color="gray",
)
# Finalize the plot
plt.tight_layout()
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"{simID}_PepIDBenchmark_MCC_Curve",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
Figure saved to: ./figures/Sim1//pdf/Sim1_PepIDBenchmark_MCC_Curve.pdf
MCC Summary: Barplot Comparison¶
Barplots compare mean MCC (with 95% CI) across methods and perturbation patterns. Scatter points show MCC at the specific p-value threshold (10⁻³). Values below bars indicate mean MCC scores.
# # MCC Score for different perturbations and methods at the best threshold
# Create subplots for MCC Score for different perturbations and methods at the best threshold
fig, axes = plt.subplots(
1, 2, figsize=(12, 5),
sharey=True, sharex=True,
gridspec_kw={"wspace": 0.05, "hspace": 0.1}
)
for i, dataType in enumerate(dataTypes):
cur_data = idBenchmarkData[idBenchmarkData["DataType"] == dataType]
sns.barplot(
ax=axes[i],
data=cur_data,
x="Experiment",
y="MCC",
hue="Method",
edgecolor="black",
linewidth=0.8,
errorbar="ci", # Show confidence intervals across all thresholds
capsize=0.05,
err_kws={'linewidth': 1.5}
)
# Write the average MCC score for each method on the bottom of each bar
for j, pert in enumerate(idBenchmarkData['Experiment'].unique()):
for k, method in enumerate(method_palette.keys()): # Changed from hardcoded list
cur_score = cur_data[
(cur_data["Experiment"] == pert) &
(cur_data["Method"] == method)
]["MCC"].mean()
axes[i].text(
j + k * 0.25 - 0.25,
0 - 0.05,
f"{cur_score:.2f}",
color=method_palette[method],
ha="center",
va="top",
fontsize=10,
fontweight="bold",
rotation=90,
)
# Get data specifically at the p-value threshold for point overlay
pthr_data = cur_data[cur_data['threshold'] == pthr]
# Add points for specific p-value threshold
for j, method in enumerate(method_palette.keys()):
method_data = pthr_data[pthr_data['Method'] == method]
# Get x positions for this method's bars
xlist = sorted(cur_data['Experiment'].unique())
x_positions = []
y_positions = []
for k, x in enumerate(experiment_mapper.values()):
mcc_at_pthr = method_data.loc[method_data['Experiment'] == x, 'MCC']
if not mcc_at_pthr.empty:
# Calculate x position (center of bar for this method)
x_pos = k + (j - 1) * 0.27 # Adjust based on seaborn's bar positioning
x_positions.append(x_pos)
y_positions.append(mcc_at_pthr.iloc[0])
# Add scatter points for this method at pthr
axes[i].scatter(
x_positions,
y_positions,
color=method_palette[method],
s=80,
edgecolor='black',
linewidth=1.25,
marker='o',
zorder=10,
alpha=0.9
)
# Draw MCC interpretation thresholds
for thresh, label in mcc_thresholds.items():
axes[i].axhline(
thresh, color=mcc_colors[label], alpha=1,
linestyle="dotted", linewidth=1.5,
label=label, zorder=0
)
axes[i].set_title(f"{dataType.capitalize()}", fontsize=14, fontweight="bold")
axes[i].set_xlabel('')
axes[i].set_ylim(-0.15, 1.0)
axes[i].grid("y", linestyle="--", linewidth=0.75, alpha=0.5, color="lightgrey")
axes[i].set_xticklabels(axes[i].get_xticklabels(), rotation=0, ha="center", fontsize=10)
# If first subplot, set the ylabel
if i == 0:
axes[i].set_ylabel("MCC Score", fontsize=12)
# Remove the legend for the first subplot
axes[i].legend().set_visible(False)
else:
axes[i].set_ylabel("")
axes[i].legend(
loc='upper right', title="Method", fontsize=10, title_fontsize=12,
frameon=False,
)
# Add a text to clarify the various annotations of the plot
# Include the p-value threshold in the text
# Include the MCC interpretation color legend
fig.text(
0.9,
.97,
f"Black circles represent the MCC values at p-value threshold of {pthr}\nMCC interpretation thresholds are indicated by dotted lines",
ha="right",
va="center",
fontsize=12,
fontstyle="italic",
color="gray",
)
plt.tight_layout()
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"{simID}_PepIDBenchmark_MCC_Barplot",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
Figure saved to: ./figures/Sim1//pdf/Sim1_PepIDBenchmark_MCC_Barplot.pdf
Simulation 2: Missingness Level Impact¶
Objective: Quantify how increasing proportions of missing data affect peptide identification performance after imputation.
Input Data: ./data/Sim2/2_Pro{rate}_Pep{rate}_imputed_{method}_ResultData.feather
Experimental Design:
- Protein missingness: 0%, 20%, 40%, 60%, 80%
- Peptide missingness: 0%, 20%, 40%, 60%, 80%
- Full factorial: 5 × 5 = 25 combinations per method
Metrics Generated:
- AUROC heatmaps (method × missingness)
- MCC heatmaps (mean, at p-threshold, maximum)
Key Question: At what missingness level does performance critically degrade?
Data Processing and Metric Calculation¶
stTime = utils.getTime()
simID = "Sim2"
# Set up a path for the simulation
output_path, figure_path = setup_simulation_paths( simID )
methods = list(method_palette.keys())
missRatePros = [0, 0.2, 0.4, 0.6, 0.8]
missRatePeps = [0, 0.2, 0.4, 0.6, 0.8]
missRate_mapper = {
0: '0%',
0.2: '20%',
0.4: '40%',
0.6: '60%',
0.8: '80%',
}
print("PEPTIDE IDENTIFICATION BENCHMARK ANALYSIS")
print("-" * 52)
print(f"Simulation ID: {simID}")
print(f"Methods: {', '.join(methods)} ({len(methods)} total)")
print(f"Protein Missing Rates: {', '.join(map(str, missRatePros))}")
print(f"Peptide Missing Rates: {', '.join(map(str, missRatePeps))}")
print(f"Output Path: {output_path}")
# Progress tracking
total_combinations = len(methods) * len(missRatePros) * len(missRatePeps)
current_combination = 0
files_found = 0
files_missing = 0
print(f"Total combinations to process: {total_combinations}")
print("-" * 52)
display("Calculating ID Benchmarks:")
combined_results = []
for i in missRatePros:
for j in missRatePeps:
for cur_method in methods:
current_combination += 1
# Load the result data
result_file = f"{output_path}2_Pro{i}_Pep{j}_imputed_{cur_method}_ResultData.feather"
if os.path.exists(result_file):
res_df = pd.read_feather(result_file)
files_found += 1
if verbose:
progress = f"[{current_combination:2d}/{total_combinations}]"
print(f"{progress} ✓ {cur_method:12} | ProMissing:{i:.1f} | PepMissing:{j:.1f}")
else:
files_missing += 1
if verbose:
progress = f"[{current_combination:2d}/{total_combinations}]"
print(f"{progress} ✗ {cur_method:12} | ProMissing:{i:.1f} | PepMissing:{j:.1f} (MISSING)")
continue
# Handle COPF-specific column renaming
if cur_method == "COPF":
res_df = res_df.rename(columns={
"proteoform_score_pval": "adj_pval",
'protein_id': "Protein"
})
# Create metric data
metric_data = utils.create_metric_data(
res_df,
pvalue_thresholds=thresholds,
label_col='pertPeptide',
pvalue_col='adj_pval'
)
metric_data['Method'] = cur_method
metric_data['ProteinMissingness'] = i
metric_data['PeptideMissingness'] = j
combined_results.append(metric_data)
# Combine and process data
idBenchmarkData = pd.concat(combined_results, ignore_index=True)
# Save the processed data
output_file = f"{output_path}4_{simID}_PeptideIdentification_PerformanceData.feather"
idBenchmarkData.to_feather(output_file)
idBenchmarkData.to_csv(output_file.replace('.feather', '.csv'), index=False)
print(f"\nDATA PREVIEW:")
display(idBenchmarkData.tail(5))
print(f"\nRESULTS SUMMARY:")
print("=" * 52)
execTime = utils.prettyTimer(utils.getTime() - stTime)
print(f"Total execution time: {execTime}")
print(f"Files processed successfully: {files_found}")
print(f"Files missing/skipped: {files_missing}")
print(f"Processing success rate: {files_found/(files_found+files_missing)*100:.1f}%")
print("-" * 52)
print(f"Final dataset shape: {idBenchmarkData.shape}")
protein_levels = sorted(idBenchmarkData['ProteinMissingness'].unique())
peptide_levels = sorted(idBenchmarkData['PeptideMissingness'].unique())
methods_list = sorted(idBenchmarkData['Method'].unique())
print(f"Unique protein missingness levels: {len(protein_levels)} -> {protein_levels}")
print(f"Unique peptide missingness levels: {len(peptide_levels)} -> {peptide_levels}")
print(f"Methods analyzed: {idBenchmarkData['Method'].nunique()} -> {', '.join(methods_list)}")
print(f"Threshold levels included: {idBenchmarkData['threshold'].nunique()} (min={idBenchmarkData['threshold'].min()}, max={idBenchmarkData['threshold'].max()})")
print(f"Data saved to: {output_file}")
print("=" * 52)
PEPTIDE IDENTIFICATION BENCHMARK ANALYSIS ---------------------------------------------------- Simulation ID: Sim2 Methods: COPF, PeCorA, ProteoForge (3 total) Protein Missing Rates: 0, 0.2, 0.4, 0.6, 0.8 Peptide Missing Rates: 0, 0.2, 0.4, 0.6, 0.8 Output Path: ./data/Sim2/ Total combinations to process: 75 ----------------------------------------------------
'Calculating ID Benchmarks:'
[ 1/75] ✓ COPF | ProMissing:0.0 | PepMissing:0.0 [ 2/75] ✓ PeCorA | ProMissing:0.0 | PepMissing:0.0 [ 3/75] ✓ ProteoForge | ProMissing:0.0 | PepMissing:0.0 [ 4/75] ✓ COPF | ProMissing:0.0 | PepMissing:0.2 [ 5/75] ✓ PeCorA | ProMissing:0.0 | PepMissing:0.2 [ 6/75] ✓ ProteoForge | ProMissing:0.0 | PepMissing:0.2 [ 7/75] ✓ COPF | ProMissing:0.0 | PepMissing:0.4 [ 8/75] ✓ PeCorA | ProMissing:0.0 | PepMissing:0.4 [ 9/75] ✓ ProteoForge | ProMissing:0.0 | PepMissing:0.4 [10/75] ✓ COPF | ProMissing:0.0 | PepMissing:0.6 [11/75] ✓ PeCorA | ProMissing:0.0 | PepMissing:0.6 [12/75] ✓ ProteoForge | ProMissing:0.0 | PepMissing:0.6 [13/75] ✓ COPF | ProMissing:0.0 | PepMissing:0.8 [14/75] ✓ PeCorA | ProMissing:0.0 | PepMissing:0.8 [15/75] ✓ ProteoForge | ProMissing:0.0 | PepMissing:0.8 [16/75] ✓ COPF | ProMissing:0.2 | PepMissing:0.0 [17/75] ✓ PeCorA | ProMissing:0.2 | PepMissing:0.0 [18/75] ✓ ProteoForge | ProMissing:0.2 | PepMissing:0.0 [19/75] ✓ COPF | ProMissing:0.2 | PepMissing:0.2 [20/75] ✓ PeCorA | ProMissing:0.2 | PepMissing:0.2 [21/75] ✓ ProteoForge | ProMissing:0.2 | PepMissing:0.2 [22/75] ✓ COPF | ProMissing:0.2 | PepMissing:0.4 [23/75] ✓ PeCorA | ProMissing:0.2 | PepMissing:0.4 [24/75] ✓ ProteoForge | ProMissing:0.2 | PepMissing:0.4 [25/75] ✓ COPF | ProMissing:0.2 | PepMissing:0.6 [26/75] ✓ PeCorA | ProMissing:0.2 | PepMissing:0.6 [27/75] ✓ ProteoForge | ProMissing:0.2 | PepMissing:0.6 [28/75] ✓ COPF | ProMissing:0.2 | PepMissing:0.8 [29/75] ✓ PeCorA | ProMissing:0.2 | PepMissing:0.8 [30/75] ✓ ProteoForge | ProMissing:0.2 | PepMissing:0.8 [31/75] ✓ COPF | ProMissing:0.4 | PepMissing:0.0 [32/75] ✓ PeCorA | ProMissing:0.4 | PepMissing:0.0 [33/75] ✓ ProteoForge | ProMissing:0.4 | PepMissing:0.0 [34/75] ✓ COPF | ProMissing:0.4 | PepMissing:0.2 [35/75] ✓ PeCorA | ProMissing:0.4 | PepMissing:0.2 [36/75] ✓ ProteoForge | ProMissing:0.4 | PepMissing:0.2 [37/75] ✓ COPF | ProMissing:0.4 | PepMissing:0.4 [38/75] ✓ PeCorA | ProMissing:0.4 | PepMissing:0.4 [39/75] ✓ ProteoForge | ProMissing:0.4 | PepMissing:0.4 [40/75] ✓ COPF | ProMissing:0.4 | PepMissing:0.6 [41/75] ✓ PeCorA | ProMissing:0.4 | PepMissing:0.6 [42/75] ✓ ProteoForge | ProMissing:0.4 | PepMissing:0.6 [43/75] ✓ COPF | ProMissing:0.4 | PepMissing:0.8 [44/75] ✓ PeCorA | ProMissing:0.4 | PepMissing:0.8 [45/75] ✓ ProteoForge | ProMissing:0.4 | PepMissing:0.8 [46/75] ✓ COPF | ProMissing:0.6 | PepMissing:0.0 [47/75] ✓ PeCorA | ProMissing:0.6 | PepMissing:0.0 [48/75] ✓ ProteoForge | ProMissing:0.6 | PepMissing:0.0 [49/75] ✓ COPF | ProMissing:0.6 | PepMissing:0.2 [50/75] ✓ PeCorA | ProMissing:0.6 | PepMissing:0.2 [51/75] ✓ ProteoForge | ProMissing:0.6 | PepMissing:0.2 [52/75] ✓ COPF | ProMissing:0.6 | PepMissing:0.4 [53/75] ✓ PeCorA | ProMissing:0.6 | PepMissing:0.4 [54/75] ✓ ProteoForge | ProMissing:0.6 | PepMissing:0.4 [55/75] ✓ COPF | ProMissing:0.6 | PepMissing:0.6 [56/75] ✓ PeCorA | ProMissing:0.6 | PepMissing:0.6 [57/75] ✓ ProteoForge | ProMissing:0.6 | PepMissing:0.6 [58/75] ✓ COPF | ProMissing:0.6 | PepMissing:0.8 [59/75] ✓ PeCorA | ProMissing:0.6 | PepMissing:0.8 [60/75] ✓ ProteoForge | ProMissing:0.6 | PepMissing:0.8 [61/75] ✓ COPF | ProMissing:0.8 | PepMissing:0.0 [62/75] ✓ PeCorA | ProMissing:0.8 | PepMissing:0.0 [63/75] ✓ ProteoForge | ProMissing:0.8 | PepMissing:0.0 [64/75] ✓ COPF | ProMissing:0.8 | PepMissing:0.2 [65/75] ✓ PeCorA | ProMissing:0.8 | PepMissing:0.2 [66/75] ✓ ProteoForge | ProMissing:0.8 | PepMissing:0.2 [67/75] ✓ COPF | ProMissing:0.8 | PepMissing:0.4 [68/75] ✓ PeCorA | ProMissing:0.8 | PepMissing:0.4 [69/75] ✓ ProteoForge | ProMissing:0.8 | PepMissing:0.4 [70/75] ✓ COPF | ProMissing:0.8 | PepMissing:0.6 [71/75] ✓ PeCorA | ProMissing:0.8 | PepMissing:0.6 [72/75] ✓ ProteoForge | ProMissing:0.8 | PepMissing:0.6 [73/75] ✓ COPF | ProMissing:0.8 | PepMissing:0.8 [74/75] ✓ PeCorA | ProMissing:0.8 | PepMissing:0.8 [75/75] ✓ ProteoForge | ProMissing:0.8 | PepMissing:0.8 DATA PREVIEW:
| TP | FP | TN | FN | TPR | FPR | FDR | MCC | Precision | Recall | F1 | threshold | Method | ProteinMissingness | PeptideMissingness | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1870 | 1090 | 964 | 3781 | 549 | 0.6650 | 0.2032 | 0.4693 | 0.4319 | 0.5307 | 0.6650 | 0.5903 | 0.6000 | ProteoForge | 0.8000 | 0.8000 |
| 1871 | 1101 | 988 | 3757 | 538 | 0.6718 | 0.2082 | 0.4730 | 0.4316 | 0.5270 | 0.6718 | 0.5907 | 0.7000 | ProteoForge | 0.8000 | 0.8000 |
| 1872 | 1104 | 1017 | 3728 | 535 | 0.6736 | 0.2143 | 0.4795 | 0.4259 | 0.5205 | 0.6736 | 0.5872 | 0.8000 | ProteoForge | 0.8000 | 0.8000 |
| 1873 | 1111 | 1038 | 3707 | 528 | 0.6779 | 0.2188 | 0.4830 | 0.4244 | 0.5170 | 0.6779 | 0.5866 | 0.9000 | ProteoForge | 0.8000 | 0.8000 |
| 1874 | 1639 | 4745 | 0 | 0 | 1.0000 | 1.0000 | 0.7433 | 0.0000 | 0.2567 | 1.0000 | 0.4086 | 1.0000 | ProteoForge | 0.8000 | 0.8000 |
RESULTS SUMMARY: ==================================================== Total execution time: 00h:00m:09s Files processed successfully: 75 Files missing/skipped: 0 Processing success rate: 100.0% ---------------------------------------------------- Final dataset shape: (1875, 15) Unique protein missingness levels: 5 -> [np.float64(0.0), np.float64(0.2), np.float64(0.4), np.float64(0.6), np.float64(0.8)] Unique peptide missingness levels: 5 -> [np.float64(0.0), np.float64(0.2), np.float64(0.4), np.float64(0.6), np.float64(0.8)] Methods analyzed: 3 -> COPF, PeCorA, ProteoForge Threshold levels included: 25 (min=0.0, max=1.0) Data saved to: ./data/Sim2/4_Sim2_PeptideIdentification_PerformanceData.feather ====================================================
AUROC Heatmaps: Discrimination Across Missingness¶
Heatmaps show AUROC values for each method across the protein × peptide missingness grid. Higher values (darker blue) indicate better discrimination. The (0,0) corner represents complete data baseline.
# Comparing AUC Values across different missingness levels
# Calculate AUC for ROC and PR curves
auc_results = []
for (method, pro_miss, pep_miss), group in idBenchmarkData.groupby(
['Method', 'ProteinMissingness', 'PeptideMissingness']
):
# ROC AUC
group = utils.complete_curve_data(group, 'ROC', 'FPR', 'TPR')
roc_data = group.sort_values('FPR')
roc_auc = np.trapezoid(roc_data['TPR'], roc_data['FPR'])
auc_results.append({
'Method': method,
'ProteinMissingness': pro_miss,
'PeptideMissingness': pep_miss,
'AUC': roc_auc,
})
auc_df = pd.DataFrame(auc_results)
print("\nAUC Data Preview:")
display(auc_df.head())
# Initialize Figure for 3 Methods with dedicated colorbar column
fig, axes = plt.subplots(
nrows=1, ncols=4, figsize=(12, 4),
gridspec_kw={
"wspace": 0.05, "hspace": 0.1,
"width_ratios": [0.9, 0.9, 0.9, 0.1] # First 3 columns same size, 4th for colorbar
},
)
vmin = auc_df["AUC"].min()
vmax = auc_df["AUC"].max()
# Create heatmaps for the first 3 methods
for i, cur_method in enumerate(methods):
plot_data = auc_df[auc_df["Method"] == cur_method].pivot_table(
index="PeptideMissingness",
columns="ProteinMissingness",
values="AUC",
aggfunc="mean"
)
# Make sure 0,0 is in the bottom left corner
plot_data = plot_data.iloc[::-1]
# Only show colorbar on the last method heatmap
cb = (i == len(methods) - 1)
# If this is the last method, create colorbar in the 4th column
if cb:
cbar_ax = axes[3]
else:
cbar_ax = None
sns.heatmap(
plot_data,
vmin=vmin,
vmax=vmax,
cmap="Blues",
annot=True,
fmt=".2f",
ax=axes[i],
cbar=cb,
cbar_ax=cbar_ax,
cbar_kws={"label": "AUROC"},
square=True,
# Annotation size
annot_kws={"size": 12},
)
if i == 0:
axes[i].set_ylabel("Peptide Missingness Rate", fontsize=14, fontweight="bold")
else:
axes[i].set_ylabel("")
# Remove yticklabels
axes[i].set_yticklabels([])
if i == 1:
axes[i].set_xlabel("Protein Missingness Rate", fontsize=14, fontweight="bold")
else:
axes[i].set_xlabel("")
axes[i].set_title(f"{cur_method}", fontsize=14, fontweight="bold")
# Hide the 4th axis if it wasn't used for colorbar (this handles the case where we have < 3 methods)
if len(methods) < 3:
axes[3].set_visible(False)
plt.tight_layout()
plt.suptitle("Peptide Identification Benchmark with AUROC", fontsize=16, fontweight="bold")
plt.subplots_adjust(top=0.85)
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"{simID}_PepIDBenchmark_AUROC_Heatmap",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
AUC Data Preview:
| Method | ProteinMissingness | PeptideMissingness | AUC | |
|---|---|---|---|---|
| 0 | COPF | 0.0000 | 0.0000 | 0.5901 |
| 1 | COPF | 0.0000 | 0.2000 | 0.5799 |
| 2 | COPF | 0.0000 | 0.4000 | 0.5783 |
| 3 | COPF | 0.0000 | 0.6000 | 0.5782 |
| 4 | COPF | 0.0000 | 0.8000 | 0.6050 |
Figure saved to: ./figures/Sim2//pdf/Sim2_PepIDBenchmark_AUROC_Heatmap.pdf
Mean MCC Heatmaps: Average Performance¶
Shows mean MCC averaged across all p-value thresholds. Provides a threshold-agnostic view of overall method performance at each missingness combination.
# Mean MCC Scores
# Heatmap of the MCC values for the different methods
# (Protein Missingness and Peptide Missingness) (X: Protein Missingness, Y: Peptide Missingness)
# Initalize Figure for 3 Methods
fig, axes = plt.subplots(
nrows=1, ncols=4, figsize=(12, 4),
gridspec_kw={
"wspace": 0.05, "hspace": 0.1,
"width_ratios": [0.9, 0.9, 0.9, 0.1] # First 3 columns same size, 4th for colorbar
},
)
vmin = idBenchmarkData["MCC"].min()
vmax = idBenchmarkData["MCC"].max()
# Create heatmaps for the first 3 methods
for i, cur_method in enumerate(methods):
plot_data = idBenchmarkData[idBenchmarkData["Method"] == cur_method].pivot_table(
index="PeptideMissingness",
columns="ProteinMissingness",
values="MCC",
aggfunc="mean"
)
# Make sure 0,0 is in the bottom left corner
plot_data = plot_data.iloc[::-1]
# Only show colorbar on the last method heatmap
cb = (i == len(methods) - 1)
# If this is the last method, create colorbar in the 4th column
if cb:
cbar_ax = axes[3]
else:
cbar_ax = None
sns.heatmap(
plot_data,
vmin=vmin,
vmax=vmax,
cmap="Purples",
annot=True,
fmt=".2f",
ax=axes[i],
cbar=cb,
cbar_ax=cbar_ax,
cbar_kws={"label": "Mean MCC"},
square=True,
# Annotation size
annot_kws={"size": 12},
)
if i == 0:
axes[i].set_ylabel("Peptide Missingness Rate", fontsize=14, fontweight="bold")
else:
axes[i].set_ylabel("")
# Remove yticklabels
axes[i].set_yticklabels([])
if i == 1:
axes[i].set_xlabel("Protein Missingness Rate", fontsize=14, fontweight="bold")
else:
axes[i].set_xlabel("")
axes[i].set_title(f"{cur_method}", fontsize=14, fontweight="bold")
# Hide the 4th axis if it wasn't used for colorbar (this handles the case where we have < 3 methods)
if len(methods) < 3:
axes[3].set_visible(False)
plt.tight_layout()
plt.suptitle("Peptide Identification Benchmark with MCC (Mean)", fontsize=16, fontweight="bold")
plt.subplots_adjust(top=0.85)
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"{simID}_PepIDBenchmark_MCC_mean_Heatmap",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
Figure saved to: ./figures/Sim2//pdf/Sim2_PepIDBenchmark_MCC_mean_Heatmap.pdf
MCC at P-value Threshold: Practical Performance¶
Shows MCC specifically at the commonly used significance threshold (p < 10⁻³). This represents the expected performance when using standard statistical cutoffs.
# MCC Score at pThr
# Heatmap of the MCC values for the different methods
# (Protein Missingness and Peptide Missingness) (X: Protein Missingness, Y: Peptide Missingness)
# Initalize Figure for 3 Methods
fig, axes = plt.subplots(
nrows=1, ncols=4, figsize=(12, 4),
gridspec_kw={
"wspace": 0.05, "hspace": 0.1,
"width_ratios": [0.9, 0.9, 0.9, 0.1] # First 3 columns same size, 4th for colorbar
},
)
vmin = idBenchmarkData["MCC"].min()
vmax = idBenchmarkData["MCC"].max()
# Create heatmaps for the first 3 methods
for i, cur_method in enumerate(methods):
plot_data = idBenchmarkData[
(idBenchmarkData["Method"] == cur_method) &
(idBenchmarkData["threshold"] == pthr)
].pivot_table(
index="PeptideMissingness",
columns="ProteinMissingness",
values="MCC",
aggfunc="mean"
)
# Make sure 0,0 is in the bottom left corner
plot_data = plot_data.iloc[::-1]
# Only show colorbar on the last method heatmap
cb = (i == len(methods) - 1)
# If this is the last method, create colorbar in the 4th column
if cb:
cbar_ax = axes[3]
else:
cbar_ax = None
sns.heatmap(
plot_data,
vmin=vmin,
vmax=vmax,
cmap="Purples",
annot=True,
fmt=".2f",
ax=axes[i],
cbar=cb,
cbar_ax=cbar_ax,
cbar_kws={"label": f"MCC at p-value {pthr}"},
square=True,
# Annotation size
annot_kws={"size": 12},
)
if i == 0:
axes[i].set_ylabel("Peptide Missingness Rate", fontsize=14, fontweight="bold")
else:
axes[i].set_ylabel("")
# Remove yticklabels
axes[i].set_yticklabels([])
if i == 1:
axes[i].set_xlabel("Protein Missingness Rate", fontsize=14, fontweight="bold")
else:
axes[i].set_xlabel("")
axes[i].set_title(f"{cur_method}", fontsize=14, fontweight="bold")
# Hide the 4th axis if it wasn't used for colorbar (this handles the case where we have < 3 methods)
if len(methods) < 3:
axes[3].set_visible(False)
plt.tight_layout()
plt.suptitle(f"Peptide Identification Benchmark with MCC at p-value {pthr}", fontsize=16, fontweight="bold")
plt.subplots_adjust(top=0.85)
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"{simID}_PepIDBenchmark_MCC_pThr_Heatmap",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
Figure saved to: ./figures/Sim2//pdf/Sim2_PepIDBenchmark_MCC_pThr_Heatmap.pdf
Maximum MCC Heatmaps: Best Achievable Performance¶
Shows the maximum MCC achievable across all thresholds—the theoretical upper bound of performance if the optimal threshold were known.
# Maximum MCC Scores
# Heatmap of the MCC values for the different methods
# (Protein Missingness and Peptide Missingness) (X: Protein Missingness, Y: Peptide Missingness)
# Initialize Figure for 3 Methods with dedicated colorbar column
fig, axes = plt.subplots(
nrows=1, ncols=4, figsize=(12, 4),
gridspec_kw={
"wspace": 0.05, "hspace": 0.1,
"width_ratios": [0.9, 0.9, 0.9, 0.1] # First 3 columns same size, 4th for colorbar
},
)
vmin = idBenchmarkData["MCC"].min()
vmax = idBenchmarkData["MCC"].max()
# Create heatmaps for the first 3 methods
for i, cur_method in enumerate(methods):
plot_data = idBenchmarkData[idBenchmarkData["Method"] == cur_method].pivot_table(
index="PeptideMissingness",
columns="ProteinMissingness",
values="MCC",
aggfunc="max"
)
# Make sure 0,0 is in the bottom left corner
plot_data = plot_data.iloc[::-1]
# Only show colorbar on the last method heatmap
cb = (i == len(methods) - 1)
# If this is the last method, create colorbar in the 4th column
if cb:
cbar_ax = axes[3]
else:
cbar_ax = None
sns.heatmap(
plot_data,
vmin=vmin,
vmax=vmax,
cmap="Purples",
annot=True,
fmt=".2f",
ax=axes[i],
cbar=cb,
cbar_ax=cbar_ax,
cbar_kws={"label": "Max MCC"},
square=True,
# Annotation size
annot_kws={"size": 12},
)
if i == 0:
axes[i].set_ylabel("Peptide Missingness Rate", fontsize=14, fontweight="bold")
else:
axes[i].set_ylabel("")
# Remove yticklabels
axes[i].set_yticklabels([])
if i == 1:
axes[i].set_xlabel("Protein Missingness Rate", fontsize=14, fontweight="bold")
else:
axes[i].set_xlabel("")
axes[i].set_title(f"{cur_method}", fontsize=14, fontweight="bold")
# Hide the 4th axis if it wasn't used for colorbar (this handles the case where we have < 3 methods)
if len(methods) < 3:
axes[3].set_visible(False)
plt.tight_layout()
plt.suptitle("Peptide Identification Benchmark with MCC (Max)", fontsize=16, fontweight="bold")
plt.subplots_adjust(top=0.85)
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"{simID}_PepIDBenchmark_MCC_max_Heatmap",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
Figure saved to: ./figures/Sim2//pdf/Sim2_PepIDBenchmark_MCC_max_Heatmap.pdf
Simulation 3: Perturbation Magnitude Sensitivity¶
Objective: Determine the minimum perturbation magnitude required for reliable peptide identification across methods.
Input Data: ./data/Sim3/2_{low}_{high}_{method}_ResultData.feather
Magnitude Ranges (log2 fold-change): | Range | Interpretation | |-------|----------------| | 0.10–0.25 | Subtle | | 0.25–0.50 | Small | | 0.50–0.75 | Moderate | | 0.75–1.00 | Medium | | 1.00–1.25 | Substantial | | 1.25–1.50 | Large | | 1.50–1.75 | Strong | | 1.75–2.00 | Very strong |
Metrics Generated:
- MCC line plots across magnitude ranges
- AUROC heatmap (method × magnitude)
Key Question: What is the practical detection threshold for each method?
Data Processing and Metric Calculation¶
stTime = utils.getTime()
simID = "Sim3"
# Set up a path for the simulation
output_path, figure_path = setup_simulation_paths( simID )
methods = list(method_palette.keys())
perturbationRanges = [
(0.1, 0.25),
(0.25, 0.50),
(0.50, 0.75),
(0.75, 1.0),
(1.0, 1.25),
(1.25, 1.50),
(1.50, 1.75),
(1.75, 2.0),
]
perturbation_mapper = {
(0.1, 0.25): '0.1-0.25',
(0.25, 0.50): '0.25-0.50',
(0.50, 0.75): '0.50-0.75',
(0.75, 1.0): '0.75-1.0',
(1.0, 1.25): '1.0-1.25',
(1.25, 1.50): '1.25-1.50',
(1.50, 1.75): '1.50-1.75',
(1.75, 2.0): '1.75-2.0',
}
print("PEPTIDE IDENTIFICATION BENCHMARK ANALYSIS")
print("-" * 52)
print(f"Simulation ID: {simID}")
print(f"Methods: {', '.join(methods)} ({len(methods)} total)")
print(f"Perturbation Ranges: {', '.join([f'{low}-{high}' for low, high in perturbationRanges])}")
print(f"Output Path: {output_path}")
# Progress tracking
total_combinations = len(methods) * len(perturbationRanges)
current_combination = 0
files_found = 0
files_missing = 0
print(f"Total combinations to process: {total_combinations}")
print("-" * 52)
display("Calculating ID Benchmarks:")
combined_results = []
for low, high in perturbationRanges:
for cur_method in methods:
current_combination += 1
# Load the result data
result_file = f"{output_path}2_{low}_{high}_{cur_method}_ResultData.feather"
if os.path.exists(result_file):
res_df = pd.read_feather(result_file)
files_found += 1
if verbose:
progress = f"[{current_combination:2d}/{total_combinations}]"
print(f"{progress} ✓ {cur_method:12} | Perturbation: {low:.2f}-{high:.2f}")
else:
files_missing += 1
if verbose:
progress = f"[{current_combination:2d}/{total_combinations}]"
print(f"{progress} ✗ {cur_method:12} | Perturbation: {low:.2f}-{high:.2f} (MISSING)")
continue
# Handle COPF-specific column renaming
if cur_method == "COPF":
res_df = res_df.rename(columns={
"proteoform_score_pval": "adj_pval",
'protein_id': "Protein"
})
# Create metric data
metric_data = utils.create_metric_data(
res_df,
pvalue_thresholds=thresholds,
label_col='pertPeptide',
pvalue_col='adj_pval'
)
metric_data['Method'] = cur_method
metric_data['PerturbationRange'] = f"{low:.2f}-{high:.2f}"
combined_results.append(metric_data)
# Combine and process data
idBenchmarkData = pd.concat(combined_results, ignore_index=True)
# Save the processed data
output_file = f"{output_path}4_{simID}_PeptideIdentification_PerformanceData.feather"
idBenchmarkData.to_feather(output_file)
idBenchmarkData.to_csv(output_file.replace('.feather', '.csv'), index=False)
print(f"\nDATA PREVIEW:")
display(idBenchmarkData.tail(5))
print(f"\nRESULTS SUMMARY:")
print("=" * 52)
execTime = utils.prettyTimer(utils.getTime() - stTime)
print(f"Total execution time: {execTime}")
print(f"Files processed successfully: {files_found}")
print(f"Files missing/skipped: {files_missing}")
print(f"Processing success rate: {files_found/(files_found+files_missing)*100:.1f}%")
print("-" * 52)
print(f"Final dataset shape: {idBenchmarkData.shape}")
pert_ranges = sorted(idBenchmarkData['PerturbationRange'].unique())
methods_list = sorted(idBenchmarkData['Method'].unique())
print(f"Unique perturbation ranges: {len(pert_ranges)} -> {', '.join(pert_ranges)}")
print(f"Methods analyzed: {idBenchmarkData['Method'].nunique()} -> {', '.join(methods_list)}")
print(f"Threshold levels included: {idBenchmarkData['threshold'].nunique()} (min={idBenchmarkData['threshold'].min()}, max={idBenchmarkData['threshold'].max()})")
print(f"Data saved to: {output_file}")
print("=" * 52)
PEPTIDE IDENTIFICATION BENCHMARK ANALYSIS ---------------------------------------------------- Simulation ID: Sim3 Methods: COPF, PeCorA, ProteoForge (3 total) Perturbation Ranges: 0.1-0.25, 0.25-0.5, 0.5-0.75, 0.75-1.0, 1.0-1.25, 1.25-1.5, 1.5-1.75, 1.75-2.0 Output Path: ./data/Sim3/ Total combinations to process: 24 ----------------------------------------------------
'Calculating ID Benchmarks:'
[ 1/24] ✓ COPF | Perturbation: 0.10-0.25 [ 2/24] ✓ PeCorA | Perturbation: 0.10-0.25 [ 3/24] ✓ ProteoForge | Perturbation: 0.10-0.25 [ 4/24] ✓ COPF | Perturbation: 0.25-0.50 [ 5/24] ✓ PeCorA | Perturbation: 0.25-0.50 [ 6/24] ✓ ProteoForge | Perturbation: 0.25-0.50 [ 7/24] ✓ COPF | Perturbation: 0.50-0.75 [ 8/24] ✓ PeCorA | Perturbation: 0.50-0.75 [ 9/24] ✓ ProteoForge | Perturbation: 0.50-0.75 [10/24] ✓ COPF | Perturbation: 0.75-1.00 [11/24] ✓ PeCorA | Perturbation: 0.75-1.00 [12/24] ✓ ProteoForge | Perturbation: 0.75-1.00 [13/24] ✓ COPF | Perturbation: 1.00-1.25 [14/24] ✓ PeCorA | Perturbation: 1.00-1.25 [15/24] ✓ ProteoForge | Perturbation: 1.00-1.25 [16/24] ✓ COPF | Perturbation: 1.25-1.50 [17/24] ✓ PeCorA | Perturbation: 1.25-1.50 [18/24] ✓ ProteoForge | Perturbation: 1.25-1.50 [19/24] ✓ COPF | Perturbation: 1.50-1.75 [20/24] ✓ PeCorA | Perturbation: 1.50-1.75 [21/24] ✓ ProteoForge | Perturbation: 1.50-1.75 [22/24] ✓ COPF | Perturbation: 1.75-2.00 [23/24] ✓ PeCorA | Perturbation: 1.75-2.00 [24/24] ✓ ProteoForge | Perturbation: 1.75-2.00 DATA PREVIEW:
| TP | FP | TN | FN | TPR | FPR | FDR | MCC | Precision | Recall | F1 | threshold | Method | PerturbationRange | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 595 | 872 | 443 | 4367 | 1 | 0.9989 | 0.0921 | 0.3369 | 0.7753 | 0.6631 | 0.9989 | 0.7971 | 0.6000 | ProteoForge | 1.75-2.00 |
| 596 | 873 | 456 | 4354 | 0 | 1.0000 | 0.0948 | 0.3431 | 0.7711 | 0.6569 | 1.0000 | 0.7929 | 0.7000 | ProteoForge | 1.75-2.00 |
| 597 | 873 | 482 | 4328 | 0 | 1.0000 | 0.1002 | 0.3557 | 0.7614 | 0.6443 | 1.0000 | 0.7837 | 0.8000 | ProteoForge | 1.75-2.00 |
| 598 | 873 | 504 | 4306 | 0 | 1.0000 | 0.1048 | 0.3660 | 0.7534 | 0.6340 | 1.0000 | 0.7760 | 0.9000 | ProteoForge | 1.75-2.00 |
| 599 | 873 | 4810 | 0 | 0 | 1.0000 | 1.0000 | 0.8464 | 0.0000 | 0.1536 | 1.0000 | 0.2663 | 1.0000 | ProteoForge | 1.75-2.00 |
RESULTS SUMMARY: ==================================================== Total execution time: 00h:00m:03s Files processed successfully: 24 Files missing/skipped: 0 Processing success rate: 100.0% ---------------------------------------------------- Final dataset shape: (600, 14) Unique perturbation ranges: 8 -> 0.10-0.25, 0.25-0.50, 0.50-0.75, 0.75-1.00, 1.00-1.25, 1.25-1.50, 1.50-1.75, 1.75-2.00 Methods analyzed: 3 -> COPF, PeCorA, ProteoForge Threshold levels included: 25 (min=0.0, max=1.0) Data saved to: ./data/Sim3/4_Sim3_PeptideIdentification_PerformanceData.feather ====================================================
MCC Sensitivity Curves: Detection Threshold Analysis¶
Line plots show how MCC improves with increasing perturbation magnitude. Shaded bands represent 95% CI. Smaller markers indicate MCC at the specific p-value threshold. MCC interpretation thresholds (labeled boxes) provide performance context.
# Line plot showing MCC scores across perturbation ranges
# line plot marker at the mean MCC and CI as a shaded area
# for each method and data type
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
sns.lineplot(
ax=ax,
data=idBenchmarkData,
x="PerturbationRange",
y="MCC",
hue="Method",
style="Method",
palette=method_palette,
linewidth=3,
markers=method_markers,
markersize=12,
markeredgewidth=0.5,
markeredgecolor="black",
# Estimation and confidence interval
estimator='mean',
errorbar=('ci', 95),
err_style="band",
rasterized=True,
)
ax.set_xlabel("Perturbation Range", fontsize=14, fontweight="bold")
ax.set_ylabel("MCC Score", fontsize=14, fontweight="bold")
ax.set_title("Peptide Identification Benchmark with MCC across Perturbation Ranges", fontsize=16, fontweight="bold")
ax.set_ylim(-0.15, 1.0)
ax.grid("y", linestyle="--", linewidth=0.75, alpha=0.5, color="lightgrey")
# ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right", fontsize=10)
ax.legend(
loc='upper right', title="Method", fontsize=10, title_fontsize=12,
frameon=False,
)
# Draw MCC interpretation thresholds
for thresh, label in mcc_thresholds.items():
ax.axhline(
thresh, color=mcc_colors[label], alpha=1,
linestyle="dotted", linewidth=1.5,
label=label, zorder=0
)
# Use mixed transform: relative x, data y
ax.text(
0.01,
thresh,
label,
color=mcc_colors[label],
ha="left",
va="center", # Center on the line
fontsize=10,
fontweight="bold",
transform=ax.get_yaxis_transform(), # x in axes coords, y in data coords
bbox=dict(boxstyle='round,pad=0.3', facecolor='white', edgecolor=mcc_colors[label], alpha=0.8)
)
# Add smaller scatter points for MCC values at p-value threshold
pthr_data = idBenchmarkData[idBenchmarkData['threshold'] == pthr]
for i, method in enumerate(method_palette.keys()):
method_data = pthr_data[pthr_data['Method'] == method]
ax.scatter(
method_data['PerturbationRange'],
method_data['MCC'],
color=method_palette[method],
s=50,
edgecolor='black',
linewidth=1.25,
marker=method_markers[method],
zorder=10,
alpha=0.9
)
# Add a text to clarify the various annotations of the plot
# Include the p-value threshold in the text
fig.text(
1,
1.05,
f"Black markers represent the MCC values at p-value threshold of {pthr}\nMCC interpretation thresholds are indicated by dotted lines\nShaded areas represent 95% confidence intervals around the mean MCC",
ha="right",
va="center",
fontsize=12,
fontstyle="italic",
color="gray",
)
# Finalize the plot
plt.tight_layout()
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"{simID}_PepIDBenchmark_MCC_Curve_Perturbation",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
Figure saved to: ./figures/Sim3//pdf/Sim3_PepIDBenchmark_MCC_Curve_Perturbation.pdf
AUROC Heatmap: Compact Magnitude Comparison¶
Heatmap provides a compact view of AUROC across all methods and magnitude ranges. Enables quick identification of the magnitude threshold where discrimination becomes reliable.
# Heatmap for AUROC values across perturbation ranges and methods
fig, ax = plt.subplots(1, 1, figsize=(12, 4))
# Calculate AUC for ROC curves
auc_results = []
for (method, pert_range), group in idBenchmarkData.groupby(
['Method', 'PerturbationRange']
):
# ROC AUC
group = utils.complete_curve_data(group, 'ROC', 'FPR', 'TPR')
roc_data = group.sort_values('FPR')
roc_auc = np.trapezoid(roc_data['TPR'], roc_data['FPR'])
auc_results.append({
'Method': method,
'PerturbationRange': pert_range,
'AUC': roc_auc,
})
auc_df = pd.DataFrame(auc_results)
# print("\nAUC Data Preview:")
# display(auc_df.head())
# Pivot to create the matrix: Methods as rows, PerturbationRange as columns
pivot_data = auc_df.pivot(index='Method', columns='PerturbationRange', values='AUC')
# Create heatmap
sns.heatmap(
pivot_data,
annot=True,
fmt='.2f',
cmap='Blues',
cbar_kws={'label': 'AUROC Score', 'shrink': 0.8},
square=False,
linewidths=0.5,
linecolor='white',
annot_kws={'size': 10, 'weight': 'bold'},
ax=ax
)
ax.set_title("Peptide Identification Benchmark with AUROC across Perturbation Ranges", fontsize=16, fontweight="bold")
ax.set_xlabel("Perturbation Range", fontsize=14, fontweight="bold")
ax.set_ylabel("Method", fontsize=14, fontweight="bold")
# Rotate x-axis labels for better readabilitys
ax.set_xticklabels(ax.get_xticklabels(), rotation=0, ha="center", fontsize=10)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0, ha="right", fontsize=10)
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"{simID}_PepIDBenchmark_AUC_Heatmap_Compact",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
Figure saved to: ./figures/Sim3//pdf/Sim3_PepIDBenchmark_AUC_Heatmap_Compact.pdf
Simulation 4: Complex Experimental Designs¶
Objective: Evaluate method robustness across experimental designs with multiple interacting factors.
Input Data: ./data/Sim4/2_{N}Cond_{Overlap|NonOverlap}_{same|random}Dir_{method}_ResultData.feather
Design Factors: | Factor | Levels | Description | |--------|--------|-------------| | Conditions | 2, 3, 4, 5, 6 | Number of experimental conditions | | Overlap | True, False | Same vs. different peptides perturbed across conditions | | Direction | same, random | Uniform vs. random perturbation direction |
Total Combinations: 5 conditions × 2 overlap × 2 direction = 20 datasets per method
Key Questions:
- Does adding more conditions improve identification?
- Are overlapping perturbation patterns easier to detect?
Data Processing and Metric Calculation¶
stTime = utils.getTime()
simID = "Sim4"
# Set up a path for the simulation
output_path, figure_path = setup_simulation_paths( simID )
methods = list(method_palette.keys())
# Systematic parameter combinations
overlap_types = [True, False] # Overlap vs Non-overlap
direction_types = ["random", "same"] # Direction of perturbations
# Number of conditions to generate (with shifts)
conditions = {
2: [0, .5],
3: [0, .5, 1],
4: [0, .5, 1, 1.5],
5: [0, .5, 1, 1.5, 2],
6: [0, .5, 1, 1.5, 2, 2.5]
}
print("PEPTIDE IDENTIFICATION BENCHMARK ANALYSIS")
print("-" * 52)
print(f"Simulation ID: {simID}")
print(f"Methods: {', '.join(methods)} ({len(methods)} total)")
print(f"Overlap Types: {', '.join(['Overlap' if ot else 'Non-overlap' for ot in overlap_types])}")
print(f"Direction Types: {', '.join(direction_types)}")
print(f"Conditions (with shifts): {', '.join([str(k) for k in conditions.keys()])}")
print(f"Output Path: {output_path}")
# Progress tracking
total_combinations = len(methods) * len(overlap_types) * len(direction_types) * len(conditions)
current_combination = 0
files_found = 0
files_missing = 0
print(f"Total combinations to process: {total_combinations}")
print("-" * 52)
display("Calculating ID Benchmarks:")
combined_results = []
for overlap in overlap_types:
for direction in direction_types:
for n_conditions, shifts in conditions.items():
for cur_method in methods:
current_combination += 1
overlap_str = "Overlap" if overlap else "NonOverlap"
shifts_str = "random" if direction == "random" else "same"
result_file = f"{output_path}2_{n_conditions}Cond_{overlap_str}_{shifts_str}Dir_{cur_method}_ResultData.feather"
if os.path.exists(result_file):
res_df = pd.read_feather(result_file)
files_found += 1
if verbose:
progress = f"[{current_combination:2d}/{total_combinations}]"
print(f"{progress} ✓ {cur_method:12} | {n_conditions} Conditions | {overlap_str} | {shifts_str}")
else:
files_missing += 1
if verbose:
progress = f"[{current_combination:2d}/{total_combinations}]"
print(f"{progress} ✗ {cur_method:12} | {n_conditions} Conditions | {overlap_str} | {shifts_str} (MISSING)")
continue
# Handle COPF-specific column renaming
if cur_method == "COPF":
res_df = res_df.rename(columns={
"proteoform_score_pval": "adj_pval",
'protein_id': "Protein"
})
# Create metric data
metric_data = utils.create_metric_data(
res_df,
pvalue_thresholds=thresholds,
label_col='pertPeptide',
pvalue_col='adj_pval'
)
metric_data['Method'] = cur_method
metric_data['Overlap'] = overlap
metric_data['Direction'] = direction
metric_data['N_Conditions'] = n_conditions
metric_data['Shifts'] = ','.join([str(s) for s in shifts])
combined_results.append(metric_data)
# Combine and process data
idBenchmarkData = pd.concat(combined_results, ignore_index=True)
# Save the processed data
output_file = f"{output_path}4_{simID}_PeptideIdentification_PerformanceData.feather"
idBenchmarkData.to_feather(output_file)
idBenchmarkData.to_csv(output_file.replace('.feather', '.csv'), index=False)
print(f"\nDATA PREVIEW:")
display(idBenchmarkData.tail(5))
print(f"\nRESULTS SUMMARY:")
print("=" * 52)
execTime = utils.prettyTimer(utils.getTime() - stTime)
print(f"Total execution time: {execTime}")
print(f"Files processed successfully: {files_found}")
print(f"Files missing/skipped: {files_missing}")
print(f"Processing success rate: {files_found/(files_found+files_missing)*100:.1f}%")
print("-" * 52)
print(f"Final dataset shape: {idBenchmarkData.shape}")
overlap_types_str = [ "Overlap" if ot else "Non-overlap" for ot in overlap_types ]
print(f"Overlap Types: {len(overlap_types)} -> {', '.join(overlap_types_str)}")
print(f"Direction Types: {len(direction_types)} -> {', '.join(direction_types)}")
print(f"Conditions (with shifts): {len(conditions)} -> {', '.join([str(k) for k in conditions.keys()])}")
methods_list = sorted(idBenchmarkData['Method'].unique())
print(f"Methods analyzed: {idBenchmarkData['Method'].nunique()} -> {', '.join(methods_list)}")
print(f"Threshold levels included: {idBenchmarkData['threshold'].nunique()} (min={idBenchmarkData['threshold'].min()}, max={idBenchmarkData['threshold'].max()})")
print(f"Data saved to: {output_file}")
PEPTIDE IDENTIFICATION BENCHMARK ANALYSIS ---------------------------------------------------- Simulation ID: Sim4 Methods: COPF, PeCorA, ProteoForge (3 total) Overlap Types: Overlap, Non-overlap Direction Types: random, same Conditions (with shifts): 2, 3, 4, 5, 6 Output Path: ./data/Sim4/ Total combinations to process: 60 ----------------------------------------------------
'Calculating ID Benchmarks:'
[ 1/60] ✓ COPF | 2 Conditions | Overlap | random [ 2/60] ✓ PeCorA | 2 Conditions | Overlap | random [ 3/60] ✓ ProteoForge | 2 Conditions | Overlap | random [ 4/60] ✓ COPF | 3 Conditions | Overlap | random [ 5/60] ✓ PeCorA | 3 Conditions | Overlap | random [ 6/60] ✓ ProteoForge | 3 Conditions | Overlap | random [ 7/60] ✓ COPF | 4 Conditions | Overlap | random [ 8/60] ✓ PeCorA | 4 Conditions | Overlap | random [ 9/60] ✓ ProteoForge | 4 Conditions | Overlap | random [10/60] ✓ COPF | 5 Conditions | Overlap | random [11/60] ✓ PeCorA | 5 Conditions | Overlap | random [12/60] ✓ ProteoForge | 5 Conditions | Overlap | random [13/60] ✓ COPF | 6 Conditions | Overlap | random [14/60] ✓ PeCorA | 6 Conditions | Overlap | random [15/60] ✓ ProteoForge | 6 Conditions | Overlap | random [16/60] ✓ COPF | 2 Conditions | Overlap | same [17/60] ✓ PeCorA | 2 Conditions | Overlap | same [18/60] ✓ ProteoForge | 2 Conditions | Overlap | same [19/60] ✓ COPF | 3 Conditions | Overlap | same [20/60] ✓ PeCorA | 3 Conditions | Overlap | same [21/60] ✓ ProteoForge | 3 Conditions | Overlap | same [22/60] ✓ COPF | 4 Conditions | Overlap | same [23/60] ✓ PeCorA | 4 Conditions | Overlap | same [24/60] ✓ ProteoForge | 4 Conditions | Overlap | same [25/60] ✓ COPF | 5 Conditions | Overlap | same [26/60] ✓ PeCorA | 5 Conditions | Overlap | same [27/60] ✓ ProteoForge | 5 Conditions | Overlap | same [28/60] ✓ COPF | 6 Conditions | Overlap | same [29/60] ✓ PeCorA | 6 Conditions | Overlap | same [30/60] ✓ ProteoForge | 6 Conditions | Overlap | same [31/60] ✓ COPF | 2 Conditions | NonOverlap | random [32/60] ✓ PeCorA | 2 Conditions | NonOverlap | random [33/60] ✓ ProteoForge | 2 Conditions | NonOverlap | random [34/60] ✓ COPF | 3 Conditions | NonOverlap | random [35/60] ✓ PeCorA | 3 Conditions | NonOverlap | random [36/60] ✓ ProteoForge | 3 Conditions | NonOverlap | random [37/60] ✓ COPF | 4 Conditions | NonOverlap | random [38/60] ✓ PeCorA | 4 Conditions | NonOverlap | random [39/60] ✓ ProteoForge | 4 Conditions | NonOverlap | random [40/60] ✓ COPF | 5 Conditions | NonOverlap | random [41/60] ✓ PeCorA | 5 Conditions | NonOverlap | random [42/60] ✓ ProteoForge | 5 Conditions | NonOverlap | random [43/60] ✓ COPF | 6 Conditions | NonOverlap | random [44/60] ✓ PeCorA | 6 Conditions | NonOverlap | random [45/60] ✓ ProteoForge | 6 Conditions | NonOverlap | random [46/60] ✓ COPF | 2 Conditions | NonOverlap | same [47/60] ✓ PeCorA | 2 Conditions | NonOverlap | same [48/60] ✓ ProteoForge | 2 Conditions | NonOverlap | same [49/60] ✓ COPF | 3 Conditions | NonOverlap | same [50/60] ✓ PeCorA | 3 Conditions | NonOverlap | same [51/60] ✓ ProteoForge | 3 Conditions | NonOverlap | same [52/60] ✓ COPF | 4 Conditions | NonOverlap | same [53/60] ✓ PeCorA | 4 Conditions | NonOverlap | same [54/60] ✓ ProteoForge | 4 Conditions | NonOverlap | same [55/60] ✓ COPF | 5 Conditions | NonOverlap | same [56/60] ✓ PeCorA | 5 Conditions | NonOverlap | same [57/60] ✓ ProteoForge | 5 Conditions | NonOverlap | same [58/60] ✓ COPF | 6 Conditions | NonOverlap | same [59/60] ✓ PeCorA | 6 Conditions | NonOverlap | same [60/60] ✓ ProteoForge | 6 Conditions | NonOverlap | same DATA PREVIEW:
| TP | FP | TN | FN | TPR | FPR | FDR | MCC | Precision | Recall | F1 | threshold | Method | Overlap | Direction | N_Conditions | Shifts | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1495 | 847 | 650 | 4118 | 68 | 0.9257 | 0.1363 | 0.4342 | 0.6586 | 0.5658 | 0.9257 | 0.7023 | 0.6000 | ProteoForge | False | same | 6 | 0,0.5,1,1.5,2,2.5 |
| 1496 | 849 | 707 | 4061 | 66 | 0.9279 | 0.1483 | 0.4544 | 0.6426 | 0.5456 | 0.9279 | 0.6872 | 0.7000 | ProteoForge | False | same | 6 | 0,0.5,1,1.5,2,2.5 |
| 1497 | 852 | 772 | 3996 | 63 | 0.9311 | 0.1619 | 0.4754 | 0.6258 | 0.5246 | 0.9311 | 0.6711 | 0.8000 | ProteoForge | False | same | 6 | 0,0.5,1,1.5,2,2.5 |
| 1498 | 855 | 818 | 3950 | 60 | 0.9344 | 0.1716 | 0.4889 | 0.6152 | 0.5111 | 0.9344 | 0.6607 | 0.9000 | ProteoForge | False | same | 6 | 0,0.5,1,1.5,2,2.5 |
| 1499 | 915 | 4768 | 0 | 0 | 1.0000 | 1.0000 | 0.8390 | 0.0000 | 0.1610 | 1.0000 | 0.2774 | 1.0000 | ProteoForge | False | same | 6 | 0,0.5,1,1.5,2,2.5 |
RESULTS SUMMARY: ==================================================== Total execution time: 00h:00m:07s Files processed successfully: 60 Files missing/skipped: 0 Processing success rate: 100.0% ---------------------------------------------------- Final dataset shape: (1500, 17) Overlap Types: 2 -> Overlap, Non-overlap Direction Types: 2 -> random, same Conditions (with shifts): 5 -> 2, 3, 4, 5, 6 Methods analyzed: 3 -> COPF, PeCorA, ProteoForge Threshold levels included: 25 (min=0.0, max=1.0) Data saved to: ./data/Sim4/4_Sim4_PeptideIdentification_PerformanceData.feather
Comprehensive Barplot: Factor Interactions¶
Four-panel barplot showing MCC across number of conditions, stratified by overlap type (rows) and perturbation direction (columns). Error bars show 95% CI. Scatter points indicate MCC at the p-value threshold. Values below bars show mean MCC scores.
# Comprehensive bar plot with all perturbation ranges and statistical details
fig, axes = plt.subplots(
2, 2, figsize=(16, 10),
sharey=True, sharex=True,
gridspec_kw={
"hspace": 0.01, # Reduced vertical spacing
"wspace": 0.01 # Minimal horizontal spacing
}
)
plot_combinations = [
(True, "same"),
(True, "random"),
(False, "same"),
(False, "random"),
]
# Store handles and labels for shared legend
handles, labels = None, None
for i, (overlap, direction) in enumerate(plot_combinations):
cur_data = idBenchmarkData[
(idBenchmarkData["Overlap"] == overlap) &
(idBenchmarkData["Direction"] == direction)
]
ax = axes[i//2, i%2]
# Get unique conditions and methods for proper positioning
unique_conditions = sorted(cur_data['N_Conditions'].unique())
methods = list(method_palette.keys())
n_methods = len(methods)
# Calculate bar width and positioning (seaborn default spacing)
bar_width = 0.8 / n_methods # Total width of 0.8 divided by number of methods
bar_plot = sns.barplot(
ax=ax,
data=cur_data,
x="N_Conditions",
y="MCC",
hue="Method",
palette=method_palette,
edgecolor='black',
errwidth=1.5,
capsize=0.1,
dodge=True,
estimator='mean',
errorbar=('ci', 95),
legend=False # Disable individual legends
)
# Store legend info from first plot
if i == 0:
handles, labels = ax.get_legend_handles_labels()
# Set direction as title for each column
if i // 2 == 0: # Top row
direction_title = f"{direction.capitalize()} Direction"
ax.set_title(direction_title, fontsize=14, fontweight="bold")
# Add overlap/non-overlap labels on the right side
if i % 2 == 1: # Right column
overlap_label = "Overlap" if overlap else "Non-overlap"
ax.text(
1.05, 0.5,
overlap_label,
transform=ax.transAxes,
rotation=270,
va="center",
ha="left",
fontsize=14,
fontweight="bold"
)
# Configure axis labels
if i // 2 == 1: # Bottom row - show x-axis labels
ax.set_xlabel("Number of Conditions", fontsize=12, fontweight="bold")
else: # Top row - hide x-axis labels
ax.set_xlabel("")
# ax.set_xticklabels([])
if i % 2 == 0: # Left column - show y-axis labels
ax.set_ylabel("Mean MCC Score", fontsize=12, fontweight="bold")
else: # Right column - hide y-axis labels
ax.set_ylabel("")
# Styling
ax.set_ylim(-0.15, 1.0)
ax.grid("y", linestyle="--", linewidth=0.75, alpha=0.5, color="lightgrey")
# Write the average MCC score for each method at the bottom of each bar group
for j, condition in enumerate(unique_conditions):
for k, method in enumerate(methods):
avg_score = cur_data[
(cur_data["N_Conditions"] == condition) &
(cur_data["Method"] == method)
]["MCC"].mean()
# Calculate the x position for this method's bar
x_pos = j + (k - (n_methods - 1) / 2) * bar_width
ax.text(
x_pos,
-0.01, # Below the x-axis
f"{avg_score:.2f}",
color=method_palette[method],
ha="center",
va="top",
fontsize=10,
fontweight="bold",
rotation=90,
clip_on=False,
)
# Draw MCC interpretation thresholds
for thresh, label in mcc_thresholds.items():
ax.axhline(
thresh, color=mcc_colors[label], alpha=1,
linestyle="dotted", linewidth=1.5,
label=label, zorder=0
)
# Only draw text for first column to avoid clutter
if i % 2 == 1:
continue
# Use mixed transform: relative x, data y
ax.text(
0.01,
thresh,
label,
color=mcc_colors[label],
ha="left",
va="center", # Center on the line
fontsize=10,
fontweight="bold",
transform=ax.get_yaxis_transform(), # x in axes coords, y in data coords
bbox=dict(boxstyle='round,pad=0.3', facecolor='white', edgecolor=mcc_colors[label], alpha=0.8),
zorder=0
)
# Add scatter points for MCC values at p-value threshold
pthr_data = cur_data[cur_data['threshold'] == pthr]
for j, method in enumerate(methods):
method_data = pthr_data[pthr_data['Method'] == method]
x_positions = []
y_positions = []
for condition in unique_conditions:
condition_data = method_data[method_data['N_Conditions'] == condition]
if not condition_data.empty:
# Calculate x position: condition index + offset for this method
condition_idx = unique_conditions.index(condition)
x_offset = (j - (n_methods - 1) / 2) * bar_width
x_pos = condition_idx + x_offset
x_positions.append(x_pos)
y_positions.append(condition_data['MCC'].iloc[0])
# Add scatter points for this method
if x_positions: # Only scatter if we have data
ax.scatter(
x_positions,
y_positions,
color=method_palette[method],
s=75,
edgecolor='black',
linewidth=1.25,
marker=method_markers[method],
zorder=10,
alpha=0.9
)
# # Add boxed text for each point
# for (x, y) in zip(x_positions, y_positions):
# ax.text(
# x,
# y - 0.05, # Slightly below the point
# f'{y:.2f}',
# ha="center",
# va="top",
# fontsize=9,
# fontweight="bold",
# color="black",
# bbox=dict(boxstyle='round,pad=0.3', facecolor='white', edgecolor='black', alpha=0.8),
# zorder=10
# )
# Create a legend for methods using marker and color
from matplotlib.lines import Line2D
method_legend_elements = [
Line2D(
[0], [0],
marker=method_markers[m],
color='w',
markerfacecolor=method_palette[m],
markeredgecolor='black',
markersize=12,
linewidth=0,
label=m
)
for m in methods
]
fig.legend(
handles=method_legend_elements,
loc='lower center',
ncol=len(methods),
fontsize=12,
title="Method",
title_fontsize=14,
frameon=False,
bbox_to_anchor=(0.5, 0.05)
)
# Adjust layout to accommodate legend
plt.tight_layout()
plt.subplots_adjust(bottom=0.15)
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"{simID}_PepIDBenchmark_MCC_Conditions_Barplot",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
Figure saved to: ./figures/Sim4//pdf/Sim4_PepIDBenchmark_MCC_Conditions_Barplot.pdf
Summary¶
Output Files Generated¶
| Simulation | File | Description |
|---|---|---|
| Sim1 | figures/roc_curves.png |
ROC curves across imputation strategies |
| Sim1 | figures/pr_curves.png |
Precision-Recall curves across imputation |
| Sim1 | figures/mcc_curves.png |
MCC vs. p-value threshold curves |
| Sim1 | figures/mcc_barplots.png |
MCC comparison barplots |
| Sim2 | figures/auroc_heatmap.png |
AUROC heatmap across missingness levels |
| Sim2 | figures/mean_mcc_heatmap.png |
Mean MCC across all thresholds |
| Sim2 | figures/mcc_at_threshold.png |
MCC at optimal p-value threshold |
| Sim2 | figures/max_mcc_heatmap.png |
Maximum achievable MCC |
| Sim3 | figures/mcc_sensitivity.png |
MCC sensitivity to magnitude |
| Sim3 | figures/auroc_magnitude.png |
AUROC heatmap across magnitudes |
| Sim4 | figures/sim4_barplot.png |
MCC across experimental conditions |
Key Findings by Simulation¶
- Sim1 (Imputation): Establishes baseline performance under controlled 30% MNAR conditions
- Sim2 (Missingness): Reveals method robustness across 0-70% missingness spectrum
- Sim3 (Magnitude): Tests sensitivity to effect size (1.2x-2.5x fold change)
- Sim4 (Conditions): Evaluates scalability with experimental complexity (3-9 conditions)
print("Notebook Execution Time:", utils.prettyTimer(utils.getTime() - startTime))
Notebook Execution Time: 00h:01m:05s